Predicting Guest Satisfaction for Airbnbs¶

STAT 301 – Group 41, Airbnb Prices in London, Rome & Budapest¶

Group Members:¶

  • Rapeewit Chanprakaisi
  • Renata Lovette
  • Angelyca Purewal
  • Yingqian Shi

Introduction¶

In recent years, the sharing economy has revolutionized the hospitality industry, with platforms like Airbnb offering travellers diverse and flexible lodging options across the globe. As Airbnb listings continue to grow, understanding the factors influencing guest satisfaction has become increasingly important for both hosts aiming to improve their offerings and for travellers seeking the best possible experience.

This project focuses on Airbnb listings across major European cities such as London, Rome, and Budapest. The dataset provides detailed information on various listing attributes, including price, location characteristics, listing type, host status, and more. In particular, it includes the guest_satisfaction_overall score, which serves as a summary measure of how satisfied guests were with their stay. Given the richness of this dataset, we aim to explore whether it is possible to predict overall guest satisfaction based on these various features.

While prior literature, such as the study by Zhu et al. (2020) and Rezazadeh Kalehbasti et al. (2021), has largely focused on predicting Airbnb prices using spatial and textual features, our work differs by shifting the outcome of interest to guest satisfaction. Their analysis demonstrated that machine learning models like XGBoost and SVR were effective for predicting prices, suggesting that similar approaches may also perform well in predicting satisfaction, though the underlying drivers may differ.

Question¶

We want to develop a model to figure out if we can predict guest satisfaction based on factors such as listing price, city (e.g., London, Rome, Budapest), location characteristics (e.g., latitude/longitude, distance from the city center), and Airbnb listing type (e.g., room type, business listing, multiple rooms or not).

Response Variable: guest_satisfaction_overall

Primary focus: Prediction – we are interested in understanding the relationship between all the variables and how they relate to and cause Airbnb guest satisfaction levels.

Data Description¶

The dataset includes different determinants of Airbnb prices across different European cities (London, Rome and Budapest) during the weekdays and weekends. For each city, the dataset includes 20 variables that were determined via spatial econometric analysis methods to analyse and identify the determinants of Airbnb prices across these cities.

Variable Name Data Type Description
...1 numerical - discrete observation ID
realSum numerical - continuous price in Euros for a 2 night stay for 2 people
room_type categorical type of room (ie. shared, private)
room_shared boolean is the room shared?
room_private boolean is the room private?
person_capacity numerical - discrete maximum guest capacity in the room
host_is_superhost boolean does the host have superhost status?
multi binary does the property have multiple listings?
biz binary is the property hosted for business purposes?
cleanliness_rating numerical - continuous cleanliness rating of the listing
guest_satisfaction_overall numerical - continuous overall guest satisfaction rating
bedrooms numerical - discrete number of bedrooms
dist numerical - continuous distance from the city centre in km
metro_dist numerical - continuous distance from the nearest metro station in km
attr_index numerical - continuous attraction index of airbnb location
attr_index_norm numerical - continuous normalised attraction index of airbnb location (0-100)
rest_index numerical - continuous restaurant index of airbnb location
rest_index_norm numerical - continuous normalised restaurant index of airbnb location (0-100)
lng numerical - continuous longitude of airbnb location
lat numerical - continuous latitude of airbnb location

The number of observations for Airbnbs in...

  • London are 4613 (weekdays) and 5378 (weekends)
  • Rome are 4491 (weekdays) and 4534 (weekends)
  • Budapest are 2073 (weekdays) and 1947 (weekends)

The data source and citation as requested by the owner(s) is

Gyódi, K., & Nawaro, Ł. (2021). Determinants of Airbnb prices in European cities: A spatial econometrics approach (Supplementary Material) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4446043

Methods and Results¶

In [34]:
library(tidyverse)
library(repr)
library(ggplot2)
library(patchwork)
library(caret)
library(glmnet)
library(rsample)
library(car)
options(repr.plot.width = 16, repr.plot.height = 10)

Exploratory Data Analysis (EDA)¶

Read data¶

In [61]:
# Loading in the 6 data sets 
budapest_weekdays <- read_csv("https://raw.githubusercontent.com/rchanpra/stat-301-project/refs/heads/main/data/budapest_weekdays.csv")
budapest_weekends <- read_csv("https://raw.githubusercontent.com/rchanpra/stat-301-project/refs/heads/main/data/budapest_weekends.csv")
london_weekdays <- read_csv("https://raw.githubusercontent.com/rchanpra/stat-301-project/refs/heads/main/data/london_weekdays.csv")
london_weekends <- read_csv("https://raw.githubusercontent.com/rchanpra/stat-301-project/refs/heads/main/data/london_weekends.csv")
rome_weekdays <- read_csv("https://raw.githubusercontent.com/rchanpra/stat-301-project/refs/heads/main/data/rome_weekdays.csv")
rome_weekends <- read_csv("https://raw.githubusercontent.com/rchanpra/stat-301-project/refs/heads/main/data/rome_weekends.csv")
New names:
• `` -> `...1`
Rows: 2074 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 1948 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 4614 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 5379 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 4492 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 4535 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Wrangling and Cleaning¶

In [62]:
# Main developer: Angelyca Purewal 
# Contributors: Rapeewit Chanprakaisi

# Adding two new columns to each of the datasets 
budapest_weekdays_data <- budapest_weekdays %>% mutate(city = "Budapest", time = "Weekday")
budapest_weekends_data <- budapest_weekends %>% mutate(city = "Budapest", time = "Weekend") 
london_weekdays_data <- london_weekdays %>% mutate(city = "London", time = "Weekday")
london_weekends_data <- london_weekends %>% mutate(city = "London", time = "Weekend") 
rome_weekdays_data <- rome_weekdays %>% mutate(city = "Rome", time = "Weekday")
rome_weekends_data <- rome_weekends %>% mutate(city = "Rome", time = "Weekend") 

# Combining all the data sets into a tidy format
airbnb_data <- bind_rows(budapest_weekdays_data, budapest_weekends_data,
                         london_weekdays_data, london_weekends_data,
                         rome_weekdays_data, rome_weekends_data)

# convert boolean to binary
airbnb_data <- airbnb_data %>% mutate(room_shared = ifelse(toupper(room_shared) == "TRUE", 1, 0),
                    room_private = ifelse(toupper(room_private) == "TRUE", 1, 0),
                    host_is_superhost = ifelse(toupper(host_is_superhost) == "TRUE", 1, 0))

head(airbnb_data)
A tibble: 6 × 22
...1realSumroom_typeroom_sharedroom_privateperson_capacityhost_is_superhostmultibizcleanliness_rating⋯distmetro_distattr_indexattr_index_normrest_indexrest_index_normlnglatcitytime
<dbl><dbl><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>
0238.9905Entire home/apt00610110⋯0.35935500.3526430 404.4047 24.116552 893.477367.6568519.0507447.50076BudapestWeekday
1300.7943Entire home/apt006001 9⋯0.92942720.20023551676.8760100.000000 452.539734.2677019.0449347.50405BudapestWeekday
2162.3819Entire home/apt00410010⋯2.45084030.2794518 163.5885 9.755551 191.992314.5382519.0217047.49882BudapestWeekday
3118.4377Entire home/apt002000 9⋯1.55944940.4779711 191.7198 11.433155 326.215624.7020519.0630147.51126BudapestWeekday
4134.4174Entire home/apt00411010⋯1.11380300.2701016 198.6035 11.843658 635.515948.1232219.0690047.49900BudapestWeekday
5127.3676Entire home/apt004010 9⋯0.26847030.1669317 635.6350 37.9059031005.653576.1511819.0548047.50094BudapestWeekday
In [63]:
# Check for missing values
cat('\nData has', sum(is.na(airbnb_data)), 'rows with missing values.')
Data has 0 rows with missing values.

Pre-selection of variables¶

Let's start by dropping the ...1 column that acts as an identifier as well as the room_shared and room_private columns. Since the different datasets have been merged into one, the values of the ...1 have become irrelevant, so we remove this column. Additionally, the variable room_type already categorizes the listing into private, shared, or entire home/apt, making these additional classifications unnecessary for our analysis. It is also unclear if room_private and room_shared were classifying a different aspect of the room, so we assume they represent the same information.

We'll also drop lng and lat because we haven't learned how to use spatial analysis variables.

In [64]:
# Main developer: Renata Lovette
# Contributors: Angelyca Purewal
In [65]:
# drop ID, 'room_shared' and 'room_private' columns
airbnb_data <- select(airbnb_data, -(c('...1', room_shared, room_private, lng, lat)))
head(airbnb_data)
A tibble: 6 × 17
realSumroom_typeperson_capacityhost_is_superhostmultibizcleanliness_ratingguest_satisfaction_overallbedroomsdistmetro_distattr_indexattr_index_normrest_indexrest_index_normcitytime
<dbl><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>
238.9905Entire home/apt6101109910.35935500.3526430 404.4047 24.116552 893.477367.65685BudapestWeekday
300.7943Entire home/apt6001 99820.92942720.20023551676.8760100.000000 452.539734.26770BudapestWeekday
162.3819Entire home/apt4100109812.45084030.2794518 163.5885 9.755551 191.992314.53825BudapestWeekday
118.4377Entire home/apt2000 99211.55944940.4779711 191.7198 11.433155 326.215624.70205BudapestWeekday
134.4174Entire home/apt4110109921.11380300.2701016 198.6035 11.843658 635.515948.12322BudapestWeekday
127.3676Entire home/apt4010 99120.26847030.1669317 635.6350 37.9059031005.653576.15118BudapestWeekday
In [66]:
# check observations in categorical levels

# roomtype per city
rm_type_per_city_count <- airbnb_data %>%
    group_by(city) %>%
    count(room_type)

rm_type_per_city_count
A grouped_df: 9 × 3
cityroom_typen
<chr><chr><int>
BudapestEntire home/apt3589
BudapestPrivate room 419
BudapestShared room 14
London Entire home/apt4384
London Private room 5559
London Shared room 50
Rome Entire home/apt5561
Rome Private room 3454
Rome Shared room 12

From the table above, the level of room_type = "Shared room" is low incomparison to the other values. If we kept this data in the dataset, there is a high chance that we would encounter problems when splitting the data into training and testing set. There is a chance that all the "Shared room" data points might fall into the testing set. Our model wouldn't recognize these points because it wouldn't have been trained on data points where room_type = "Shared room". Thus, we drop this categorical level and remove the data points accordingly.

In [67]:
# remove observations that include room_type = "Shared room"

airbnb_data <- airbnb_data %>% filter(room_type != "Shared room")

rm_type_per_city_count_2 <- airbnb_data %>%
    group_by(city) %>%
    count(room_type)

rm_type_per_city_count_2
A grouped_df: 6 × 3
cityroom_typen
<chr><chr><int>
BudapestEntire home/apt3589
BudapestPrivate room 419
London Entire home/apt4384
London Private room 5559
Rome Entire home/apt5561
Rome Private room 3454

Finally, let's also remove rest_index and attr_index. Their normalized versions are already included in the dataset to provide a more meaningful and standardized measure, making the raw versions redundant.

In [68]:
airbnb_data <- airbnb_data %>% select(-rest_index, -attr_index)
head(airbnb_data)
A tibble: 6 × 15
realSumroom_typeperson_capacityhost_is_superhostmultibizcleanliness_ratingguest_satisfaction_overallbedroomsdistmetro_distattr_index_normrest_index_normcitytime
<dbl><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>
238.9905Entire home/apt6101109910.35935500.3526430 24.11655267.65685BudapestWeekday
300.7943Entire home/apt6001 99820.92942720.2002355100.00000034.26770BudapestWeekday
162.3819Entire home/apt4100109812.45084030.2794518 9.75555114.53825BudapestWeekday
118.4377Entire home/apt2000 99211.55944940.4779711 11.43315524.70205BudapestWeekday
134.4174Entire home/apt4110109921.11380300.2701016 11.84365848.12322BudapestWeekday
127.3676Entire home/apt4010 99120.26847030.1669317 37.90590376.15118BudapestWeekday

Visualization¶

In [69]:
scatter_plot <- ggplot(airbnb_data, aes(x = log(realSum), y = guest_satisfaction_overall, color = city , shape= room_type)) + 
# I have logged realSum (the price of the Airbnb listing) because this variable has extreme values therefore 
# applying the log function will compress large values while maintaining the overall trends
  geom_jitter(alpha = 0.3, width = 0.3, height = 0.3) + # jitter is used in order to spread some of the points and reduce overlapping
#I set alpha to 0.3 because lowering the alpha (opacity) makes denser areas more visible
  labs(x = "Price (Logged, EUR)", 
       y = "Guest Satisfaction", 
       color = "City", 
       shape= "Room Types",
      title = "How Price, City, and Room Type Affect Guest Satisfaction") +
      theme_minimal() +
      theme(legend.position = "right", 
      plot.title = element_text(size = 14, face = "bold"))


box_plot <- ggplot(airbnb_data, aes(x = city, y = guest_satisfaction_overall, fill = city)) +
  geom_boxplot(outlier.shape = 16, outlier.colour = "red", outlier.size = 3, alpha = 0.5) +  
  labs(x = "City", y = "Guest Satisfaction", title = "City and Guest Satisfaction") +
  theme_minimal() +
  scale_fill_manual(values = c("skyblue", "pink", "lightgreen"))

box_plot2 <- ggplot(airbnb_data, aes(x = room_type, y = guest_satisfaction_overall, fill = room_type)) +
  geom_boxplot(outlier.shape = 16, outlier.colour = "red", outlier.size = 3, alpha = 0.5) +  
  labs(x = "Room Type", y = "Guest Satisfaction", title = "Room Type and Guest Satisfaction") +
  scale_fill_manual(values = c("orange", "purple")) + 
  theme_minimal()

box_plot + box_plot2 + scatter_plot
No description has been provided for this image

Visualization Explanation¶

I have created a scatterplot to explore all 4 of my variables, and then two box plots to explore the relationship between my categorical variables and guest satisfaction. I believe these are relevant as the box plots show that the median guest satisfaction is different for each city but there is a lot of overlap and my other box plot shows that my two room types have very similar medians and overlap a lot, this tells me how much each variable plays a role in guest satisfaction. The scatterplot gives me a big picture of what is going on in my data. I had a lot of overlap so I can draw insight from that.

Methods Plan & Computational Code¶

Methods Plan¶

Method(s) of interest: Multi-covariate Linear Regression model with Lasso for variable selection

To develop a predictive model for guest satisfaction, a continuous variable, we will use multiple predictors such as the price of the listing, city (e.g., London, Rome, Budapest), location (e.g., city center distance), and listing characteristics (e.g., room type, business suitability, number of rooms). Lasso regression will be used, as it is particularly well-suited for this task by penalizing large coefficients, thus improving model interpretability and generalizability.

Before applying Lasso, we will check for multicollinearity using the Variance Inflation Factor (VIF). VIF values higher than 5 or 10 suggest multicollinearity issues that might require addressing. After checking VIF, we will split the data into a training set (80%) and a testing set (20%). The training data will be standardized and used to fit the Lasso regression model via the glmnet function. The optimal lambda (penalty term) will be selected via cross-validation. Finally, the model's performance will be evaluated using the testing set, and its prediction accuracy will be assessed by computing the Root Mean Squared Error (RMSE).

Assumptions¶

To use an MLR model, we need to assume the following:

  • Linearity: The relationship between the dependent variable (guest_satisfaction_overall) and each independent variable should be linear.
  • Independence: Observations should be independent of each other. We assume that the dataset includes a random sample of Airbnb listings.
  • Homoscedasticity: The variance of residuals should remain constant across predictions.
  • Normality of Residuals: Residuals should follow a normal distribution
  • No extreme outliers

Potential limitations & weaknesses¶

  • Skewed distribution: Airbnb guest satisfaction ratings tend to cluster around high values and values that are multiples of 10. Leading to a skewed distribution of the response variable. This may violate normality assumptions.

  • Multicollinearity: Some predictors (e.g., location-based features) might be highly correlated, requiring ridge or lasso regression to improve model stability. Multicollinearity occurs when there is a strong association between two or more covariates. These covariates bring similar information to the model, and we have trouble isolating their effect. During the EDA stage, we've removed room_shared and room_private because they measured the same attribute. There may be other variables that will contribute to multicollinearity, so we will use VIF to check for highly correlated predictors.

Lasso will be used to mitage the previous limitation however this method adds its on limitations to the mix.

  • Lasso can shrink coefficients of predictors too much, sometimes pushing important predictors to zero. This could result in an overly simplified model, losing useful information for prediction or interpretation.
  • Lasso introduces bias into the model coefficients, especially for highly correlated predictors. While this helps in reducing overfitting, it can distort the true relationship between predictors and the outcome.
  • Lasso assumes a linear relationship and does not automatically account for interaction terms between predictors. If there are important interactions, Lasso may not capture them unless explicitly included in the model.
In [70]:
# Main developer: Angelyca Purewal
# Contributors: Renata Lovette
In [94]:
# set seed for reproducibility
set.seed(123)

# splot the data into trainingg (80%) and testing (20%) sets
airbnb_split <-
    airbnb_data %>%
    initial_split(prop = 0.8, strata = guest_satisfaction_overall)

train_data <- training(airbnb_split)
test_data <- testing(airbnb_split)
In [95]:
# fit model with all predictors
mlr_model <- lm(guest_satisfaction_overall ~ ., data = train_data)  # Fit model with all predictors
summary(mlr_model)
Call:
lm(formula = guest_satisfaction_overall ~ ., data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-77.709  -2.143   0.604   3.163  34.427 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)           31.3788115  0.5404645  58.059  < 2e-16 ***
realSum                0.0002273  0.0001553   1.464 0.143194    
room_typePrivate room  0.5235988  0.1232109   4.250 2.15e-05 ***
person_capacity       -0.0099721  0.0519793  -0.192 0.847864    
host_is_superhost      1.5669630  0.1118002  14.016  < 2e-16 ***
multi                 -0.7798785  0.1183055  -6.592 4.46e-11 ***
biz                   -2.5321455  0.1203530 -21.039  < 2e-16 ***
cleanliness_rating     6.6375584  0.0491870 134.945  < 2e-16 ***
bedrooms               0.3665806  0.0992950   3.692 0.000223 ***
dist                  -0.0144839  0.0377551  -0.384 0.701258    
metro_dist             0.2896782  0.0636307   4.552 5.34e-06 ***
attr_index_norm       -0.0032485  0.0081587  -0.398 0.690510    
rest_index_norm        0.0095233  0.0050614   1.882 0.059911 .  
cityLondon            -1.6897013  0.2488910  -6.789 1.16e-11 ***
cityRome              -1.8308295  0.1436378 -12.746  < 2e-16 ***
timeWeekend           -0.0326030  0.0927098  -0.352 0.725092    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.265 on 18356 degrees of freedom
Multiple R-squared:  0.5678,	Adjusted R-squared:  0.5674 
F-statistic:  1607 on 15 and 18356 DF,  p-value: < 2.2e-16
In [96]:
# use VIF to conduct an initial check for multicolinearity – values higher than 5 or 10 suggest multicollinearity might be problematic.
vif(mlr_model) %>% round(3)
A matrix: 14 × 3 of type dbl
GVIFDfGVIF^(1/(2*Df))
realSum1.28511.133
room_type1.72411.313
person_capacity2.16011.470
host_is_superhost1.13411.065
multi1.43311.197
biz1.56111.249
cleanliness_rating1.12011.058
bedrooms1.59211.262
dist4.49512.120
metro_dist1.88111.371
attr_index_norm3.39111.842
rest_index_norm2.81811.679
city5.31521.518
time1.00511.003

The table above shows low to moderate multicolinearity so we will use Lasso for further variable selection.

In [111]:
X_train <- model.matrix(guest_satisfaction_overall ~ ., data = train_data)[, -1]  # Remove intercept
y_train <- train_data$guest_satisfaction_overall
X_test <- model.matrix(guest_satisfaction_overall ~ ., data = test_data)[, -1]
y_test <- test_data$guest_satisfaction_overall

# Standardize predictors
X_train_scaled <- scale(X_train)
X_test_scaled <- scale(X_test, center = attr(X_train_scaled, "scaled:center"), scale = attr(X_train_scaled, "scaled:scale"))

# Perform Lasso Regression with cross-validation
cv_lasso <- cv.glmnet(X_train_scaled, y_train, alpha = 1)  # Lasso (alpha = 1)

# Get the best lambda
best_lambda_lasso <- cv_lasso$lambda.min

# Train the final Lasso model using best lambda
lasso_model <- glmnet(X_train_scaled, y_train, alpha = 1, lambda = best_lambda_lasso)
In [112]:
# see variables selected by Lasso
lasso_coefs <- coef(lasso_model, s = best_lambda_lasso)
lasso_coefs
16 x 1 sparse Matrix of class "dgCMatrix"
                                s1
(Intercept)           92.317276290
realSum                0.049296255
room_typePrivate room  0.226381498
person_capacity        .          
host_is_superhost      0.682992564
multi                 -0.335509143
biz                   -1.191989922
cleanliness_rating     6.595315479
bedrooms               0.195707769
dist                   .          
metro_dist             0.261768019
attr_index_norm        .          
rest_index_norm        0.127797759
cityLondon            -0.818598691
cityRome              -0.848788478
timeWeekend           -0.003277905

Lasso has shrunk person_capacity, dist and attr_index_norm variables to 0. We will now compare the performance of the Lasso model to the full MLR model.

In [116]:
test_pred_full <- predict(mlr_model, newdata = test_data)

# compute RMSE of the full predictive model
airbnb_test_RMSEs <- tibble(
  Model = "OLS Full Regression",
  RMSE = mltools::rmse(
    preds = test_pred_full,
    actuals = test_data$guest_satisfaction_overall
  )
)

# compute RMSE of lasso predictive model
test_pred_lasso <- predict(lasso_model, s = best_lambda_lasso, newx = X_test_scaled)

airbnb_test_RMSEs <- rbind(
  airbnb_test_RMSEs,
  tibble(
    Model = "Lasso with minimum MSE",
    RMSE = mltools::rmse(
       pred = test_pred_lasso,
       actuals = y_test)
  )
)
airbnb_test_RMSEs
A tibble: 2 × 2
ModelRMSE
<chr><dbl>
OLS Full Regression 6.127537
Lasso with minimum MSE6.127711

Analysing RMSE values¶

The RMSE values of the full MLR model and the Lasso model are nearly identical with a difference of 0.000174, which is practically negligible. The Lasso model excluded 3 coefficients. Due to the very minor difference, it suggests that our data doesn't suffer much from multicollinearity.

Visualizing Predictions¶

In [119]:
# Main developer: Angelyca Purewal

# Create a data frame with actual and predicted values
prediction_data <- data.frame(Actual = test_data$guest_satisfaction_overall, 
                              s1 = lasso_preds)

# Scatter plot: Predicted vs Actual
predicted_actual_plot <- ggplot(prediction_data, aes(x = Actual, y = s1)) +
  geom_jitter(alpha = 0.3, width = 0.3, height = 0.3) + # jitter is used in order to spread some of the points and reduce overlapping
  geom_point(color = "blue", alpha = 0.5) +  # Points for Predicted vs Actual values
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +  # Line of perfect prediction
  labs(title = "Predicted vs Actual Guest Satisfaction", 
       x = "Actual Guest Satisfaction", 
       y = "Predicted Guest Satisfaction") +
  theme_minimal() +
  theme(legend.position = "right", 
  plot.title = element_text(size = 14, face = "bold"))

predicted_actual_plot
No description has been provided for this image

Visualization Interpretation¶

If this model was perfect, all points would fall on the red line, instead, we have a spread around the line, indicating prediction error. This model predicts high guest satisfaction fairly well, because most of our training data consisted of higher ratings, which is also why this model struggles to predict low satisfaction scores, often overestimating them. The predictions are also compressed toward the average, a common effect of Lasso regression shrinking extreme values. To improve, we would need to address the imbalance in low-rated listings.

Discussion¶

This analysis demonstrated that guest satisfaction on Airbnb can be partially predicted using listing characteristics and location data. While Lasso regression provided a good balance between bias and variance, several challenges and limitations affected the model’s accuracy. Price, cleanliness, and location-based variables were among the most predictive features. Guest satisfaction tends to be biased upward, with many listings having high ratings, complicating the prediction of lower scores.

The model provided a reasonable fit for most observations, especially those with satisfaction scores clustered around the average. Lasso regression helped to reduce overfitting and offered interpretability by automatically selecting relevant variables. The low RMSE on the test set suggests that our model has moderate predictive power for new listings.

Our results emphasise that satisfaction is multifactorial, driven not only by price but also by location, cleanliness, and room features. The importance of cleanliness and amenities in increasing guest loyalty and satisfaction.(Xu and Gursoy, 2020) For hosts, improving cleanliness ratings and carefully setting price levels may improve the guest experience. For Airbnb as a platform, refining satisfaction metrics like collecting more detailed feedback or using a broader rating scale may support better model training and platform trust.

Limitations¶

The guest satisfaction variable has a left-skewed distribution, with most values concentrated near 100. This limits the model’s ability to distinguish well between average and excellent listings and to identify factors contributing to dissatisfaction. And a few listings received low satisfaction scores, which limits the model’s ability to learn patterns from negative feedback. It may cause an imbalance in the response variable and then introduce bias in model evaluation.

To improve model performance, we suggest implementing ensemble models like Random Forest or Gradient Boosted Trees, which can better capture nonlinear relationships and interaction effects.

Future Research¶

Future research could incorporate temporal features such as booking dates, seasons, and day-of-week effects to analyze how guest satisfaction varies over time. For example, satisfaction may dip during peak tourist seasons due to higher prices or stretched host resources, while winter off-season listings might show higher satisfaction due to quieter environments and better host availability. Including time-based variables could uncover seasonal patterns and improve model accuracy by capturing temporal dynamics often overlooked in static models.

References¶

Airbnb prices in European cities. (2024, March 10). Kaggle. https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities

Gyódi, K., & Nawaro, Ł. (2021). Determinants of Airbnb prices in European cities: A spatial econometrics approach (Supplementary Material) [Dataset]. In Zenodo (CERN European Organization for Nuclear Research). https://doi.org/10.5281/zenodo.4437019

Rezazadeh Kalehbasti, P., Nikolenko, L., & Rezaei, H. (2021). Airbnb price prediction using machine learning and sentiment analysis. In E. Weippl, A. Holzinger, A. M. Tjoa & P. Kieseberg (Eds.), Machine learning and knowledge extraction (pp. 173-184). Springer International Publishing AG. https://doi.org/10.1007/978-3-030-84060-0_11

Xu, X., & Gursoy, D. (2020). Exploring the relationship between servicescape, place attachment, and intention to recommend accommodations marketed through sharing economy platforms. Journal of Travel & Tourism Marketing, 37(4), 429–446. https://doi.org/10.1080/10548408.2020.1784365

In [ ]: